Heart disease refers to several types of heart conditions and it is the number one cause of death worldwide. To prevent heart disease, we must first learn how to reliably detect it. The heart disease data used in this study has various measurements on patients health and cardiovascular statistics.
Source: UChicago Medicine
Source of data
This study used dataset from a study of heart disease that has been open to the public at the UCI Machine Learning Repository which is being maintained by the Center for Machine Learning and Intelligent Systems at the University of California, Irvine.
Import packages
We shall use different set of packages in R
- Data preparation and exploration
- Machine learning packages
- Table formating
#Import packages
data_exploration_packages <- c("tidyverse", "plotly", "openxlsx")
machine_learning_packages <- c("caret","MASS", "car", "kernlab","rpart","randomForest","class","ada", "rda","e1071", "nnet","ipred", "dbarts", "klaR", "glmnet", 'earth')
table_formating_packages <- c("knitr","kableExtra")
if(!require(install.load)){
install.packages("install.load")
}
install.load::install_load(c(data_exploration_packages, machine_learning_packages, table_formating_packages))
Load and prepare the dataset
As a first step we must load the dataset.
Overview of heart disease dataset
Heart disease data
|
patient_id
|
slope_of_peak_exercise_st_segment
|
thal
|
resting_blood_pressure
|
chest_pain_type
|
num_major_vessels
|
fasting_blood_sugar_gt_120_mg_per_dl
|
resting_ekg_results
|
serum_cholesterol_mg_per_dl
|
oldpeak_eq_st_depression
|
sex
|
age
|
max_heart_rate_achieved
|
exercise_induced_angina
|
heart_disease_present
|
|
0z64un
|
1
|
normal
|
128
|
2
|
0
|
0
|
2
|
308
|
0.0
|
1
|
45
|
170
|
0
|
0
|
|
ryoo3j
|
2
|
normal
|
110
|
3
|
0
|
0
|
0
|
214
|
1.6
|
0
|
54
|
158
|
0
|
0
|
|
yt1s1x
|
1
|
normal
|
125
|
4
|
3
|
0
|
2
|
304
|
0.0
|
1
|
77
|
162
|
1
|
1
|
|
l2xjde
|
1
|
reversible_defect
|
152
|
4
|
0
|
0
|
0
|
223
|
0.0
|
1
|
40
|
181
|
0
|
1
|
|
oyt4ek
|
3
|
reversible_defect
|
178
|
1
|
0
|
0
|
2
|
270
|
4.2
|
1
|
59
|
145
|
0
|
0
|
|
ldukkw
|
1
|
normal
|
130
|
3
|
0
|
0
|
0
|
180
|
0.0
|
1
|
42
|
150
|
0
|
0
|
|
2gbyh9
|
2
|
reversible_defect
|
150
|
4
|
2
|
0
|
2
|
258
|
2.6
|
0
|
60
|
157
|
0
|
1
|
|
daa9kp
|
2
|
fixed_defect
|
150
|
4
|
1
|
0
|
2
|
276
|
0.6
|
1
|
57
|
112
|
1
|
1
|
|
3nwy2n
|
3
|
reversible_defect
|
170
|
4
|
0
|
0
|
2
|
326
|
3.4
|
1
|
59
|
140
|
1
|
1
|
|
1r508r
|
2
|
normal
|
120
|
3
|
0
|
0
|
0
|
219
|
1.6
|
0
|
50
|
158
|
0
|
0
|
|
ldg4b9
|
2
|
normal
|
120
|
4
|
0
|
0
|
2
|
302
|
0.4
|
1
|
66
|
151
|
0
|
0
|
|
xc17yq
|
1
|
normal
|
140
|
4
|
0
|
0
|
0
|
226
|
0.0
|
1
|
42
|
178
|
0
|
0
|
|
mpggsq
|
1
|
normal
|
140
|
3
|
0
|
0
|
0
|
335
|
0.0
|
1
|
64
|
158
|
0
|
1
|
|
zlyac8
|
2
|
normal
|
138
|
4
|
0
|
0
|
2
|
236
|
0.2
|
0
|
45
|
152
|
1
|
0
|
|
f06u72
|
2
|
reversible_defect
|
120
|
1
|
0
|
0
|
0
|
231
|
3.8
|
1
|
38
|
182
|
1
|
1
|
|
2fv3rc
|
2
|
reversible_defect
|
144
|
4
|
0
|
0
|
2
|
200
|
0.9
|
1
|
50
|
126
|
1
|
1
|
|
qyrkxn
|
2
|
normal
|
130
|
2
|
0
|
0
|
2
|
234
|
0.6
|
0
|
45
|
175
|
0
|
0
|
|
237mql
|
1
|
reversible_defect
|
130
|
4
|
1
|
0
|
0
|
253
|
1.4
|
1
|
60
|
144
|
1
|
1
|
|
mc750a
|
1
|
normal
|
130
|
2
|
0
|
0
|
2
|
204
|
0.0
|
1
|
29
|
202
|
0
|
0
|
|
30v796
|
1
|
normal
|
136
|
2
|
2
|
1
|
2
|
319
|
0.0
|
0
|
58
|
152
|
0
|
1
|
|
cvux3j
|
1
|
normal
|
160
|
2
|
2
|
0
|
0
|
302
|
0.4
|
0
|
71
|
162
|
0
|
0
|
|
k8899q
|
1
|
reversible_defect
|
108
|
4
|
3
|
1
|
0
|
233
|
0.1
|
1
|
52
|
147
|
0
|
0
|
|
jhdvtb
|
1
|
normal
|
106
|
4
|
2
|
0
|
0
|
223
|
0.3
|
0
|
67
|
142
|
0
|
0
|
|
5g9v0h
|
1
|
fixed_defect
|
160
|
4
|
0
|
0
|
2
|
228
|
2.3
|
1
|
66
|
138
|
0
|
0
|
|
83asqd
|
1
|
normal
|
156
|
2
|
0
|
0
|
2
|
245
|
0.0
|
1
|
70
|
143
|
0
|
0
|
|
gla0im
|
2
|
normal
|
120
|
3
|
0
|
0
|
2
|
211
|
1.5
|
0
|
68
|
115
|
0
|
0
|
|
zzmfh7
|
1
|
normal
|
128
|
4
|
1
|
0
|
2
|
303
|
0.0
|
0
|
57
|
159
|
0
|
0
|
|
f4g1ay
|
1
|
normal
|
128
|
2
|
0
|
1
|
0
|
205
|
0.0
|
1
|
52
|
184
|
0
|
0
|
|
lek9q9
|
2
|
normal
|
140
|
3
|
0
|
0
|
2
|
185
|
3.0
|
1
|
60
|
155
|
0
|
1
|
|
8265rl
|
1
|
normal
|
110
|
3
|
0
|
0
|
0
|
175
|
0.6
|
1
|
51
|
123
|
0
|
0
|
|
6017a1
|
2
|
normal
|
130
|
3
|
0
|
0
|
2
|
214
|
2.0
|
1
|
41
|
168
|
0
|
0
|
|
z7xkou
|
2
|
reversible_defect
|
150
|
4
|
3
|
0
|
2
|
225
|
1.0
|
0
|
65
|
114
|
0
|
1
|
|
k7ef7h
|
3
|
reversible_defect
|
140
|
4
|
0
|
1
|
2
|
203
|
3.1
|
1
|
53
|
155
|
1
|
1
|
|
0n5fu0
|
1
|
normal
|
180
|
4
|
0
|
0
|
0
|
325
|
0.0
|
0
|
64
|
154
|
1
|
0
|
|
55xksg
|
2
|
reversible_defect
|
112
|
3
|
1
|
0
|
2
|
230
|
2.5
|
1
|
58
|
165
|
0
|
1
|
|
pjgqa3
|
1
|
normal
|
122
|
4
|
0
|
0
|
2
|
222
|
0.0
|
1
|
48
|
186
|
0
|
0
|
|
xkdz7j
|
1
|
reversible_defect
|
150
|
3
|
1
|
1
|
0
|
126
|
0.2
|
1
|
57
|
173
|
0
|
0
|
|
tpuevg
|
1
|
normal
|
124
|
4
|
0
|
0
|
0
|
209
|
0.0
|
0
|
62
|
163
|
0
|
0
|
|
ascl42
|
1
|
normal
|
120
|
2
|
1
|
0
|
2
|
269
|
0.2
|
0
|
74
|
121
|
1
|
0
|
|
1xwoe6
|
1
|
reversible_defect
|
128
|
4
|
1
|
0
|
0
|
255
|
0.0
|
1
|
52
|
161
|
1
|
1
|
|
ty4ik8
|
2
|
normal
|
150
|
3
|
0
|
1
|
0
|
243
|
1.0
|
1
|
61
|
137
|
1
|
0
|
|
gx6yxl
|
1
|
normal
|
135
|
3
|
0
|
0
|
2
|
252
|
0.0
|
0
|
63
|
172
|
0
|
0
|
|
hlmts5
|
1
|
normal
|
110
|
3
|
1
|
1
|
2
|
265
|
0.0
|
0
|
71
|
130
|
0
|
0
|
|
yx0q6k
|
1
|
normal
|
140
|
3
|
1
|
1
|
2
|
417
|
0.8
|
0
|
65
|
157
|
0
|
0
|
|
ep1o51
|
1
|
normal
|
108
|
3
|
0
|
0
|
2
|
267
|
0.0
|
0
|
54
|
167
|
0
|
0
|
|
gj1e5z
|
1
|
reversible_defect
|
124
|
2
|
0
|
0
|
0
|
261
|
0.3
|
1
|
57
|
141
|
0
|
1
|
|
6219kl
|
2
|
normal
|
125
|
3
|
0
|
1
|
2
|
245
|
2.4
|
1
|
51
|
166
|
0
|
0
|
|
rp9g6x
|
2
|
normal
|
112
|
4
|
0
|
0
|
0
|
149
|
1.6
|
0
|
71
|
125
|
0
|
0
|
|
1aeaff
|
2
|
reversible_defect
|
120
|
2
|
1
|
0
|
2
|
281
|
1.4
|
1
|
62
|
103
|
0
|
1
|
|
y3prof
|
1
|
normal
|
105
|
2
|
0
|
0
|
0
|
204
|
0.0
|
0
|
46
|
172
|
0
|
0
|
|
3drd48
|
2
|
reversible_defect
|
140
|
4
|
2
|
0
|
0
|
311
|
1.8
|
1
|
46
|
120
|
1
|
1
|
|
ejo7p3
|
1
|
normal
|
130
|
3
|
1
|
0
|
0
|
315
|
1.9
|
1
|
43
|
162
|
0
|
0
|
|
r7y4i1
|
1
|
reversible_defect
|
130
|
4
|
3
|
1
|
2
|
330
|
1.8
|
1
|
63
|
132
|
1
|
1
|
|
mznwxv
|
2
|
fixed_defect
|
130
|
3
|
1
|
1
|
2
|
256
|
0.6
|
1
|
56
|
142
|
1
|
1
|
|
27oevk
|
1
|
normal
|
130
|
4
|
0
|
0
|
2
|
330
|
0.0
|
0
|
61
|
169
|
0
|
1
|
|
jwqi3k
|
1
|
normal
|
130
|
3
|
0
|
0
|
0
|
233
|
0.4
|
1
|
44
|
179
|
1
|
0
|
|
328lkl
|
2
|
reversible_defect
|
110
|
4
|
1
|
0
|
0
|
239
|
2.8
|
1
|
54
|
126
|
1
|
1
|
|
tlk9o8
|
2
|
reversible_defect
|
120
|
4
|
2
|
0
|
0
|
267
|
1.8
|
1
|
62
|
99
|
1
|
1
|
|
aq2vrq
|
1
|
normal
|
120
|
2
|
0
|
0
|
0
|
295
|
0.0
|
1
|
42
|
162
|
0
|
0
|
|
ilogfb
|
1
|
normal
|
110
|
4
|
1
|
0
|
2
|
197
|
0.0
|
1
|
44
|
177
|
0
|
1
|
|
rv6siv
|
2
|
reversible_defect
|
115
|
3
|
0
|
0
|
2
|
564
|
1.6
|
0
|
67
|
160
|
0
|
0
|
|
m2a4i9
|
2
|
reversible_defect
|
130
|
4
|
0
|
0
|
0
|
305
|
1.2
|
0
|
51
|
142
|
1
|
1
|
|
pwigd8
|
3
|
reversible_defect
|
130
|
4
|
0
|
1
|
2
|
283
|
1.6
|
1
|
56
|
103
|
1
|
1
|
|
qwapdq
|
2
|
normal
|
112
|
2
|
0
|
0
|
0
|
160
|
0.0
|
0
|
45
|
138
|
0
|
0
|
|
4sd1xn
|
1
|
normal
|
110
|
4
|
0
|
0
|
2
|
254
|
0.0
|
0
|
50
|
159
|
0
|
0
|
|
nck22c
|
1
|
reversible_defect
|
126
|
4
|
0
|
0
|
2
|
282
|
0.0
|
1
|
35
|
156
|
1
|
1
|
|
m6zksp
|
2
|
normal
|
130
|
4
|
3
|
0
|
2
|
322
|
2.4
|
1
|
70
|
109
|
0
|
1
|
|
f70grj
|
2
|
normal
|
115
|
4
|
0
|
0
|
0
|
303
|
1.2
|
1
|
43
|
181
|
0
|
0
|
|
k1art8
|
2
|
normal
|
135
|
2
|
0
|
0
|
2
|
250
|
1.4
|
0
|
55
|
161
|
0
|
0
|
|
mcwqgs
|
2
|
reversible_defect
|
120
|
3
|
3
|
0
|
0
|
188
|
2.0
|
1
|
49
|
139
|
0
|
1
|
|
3jsjqk
|
1
|
normal
|
120
|
2
|
0
|
0
|
0
|
220
|
0.0
|
1
|
44
|
170
|
0
|
0
|
|
ik7hfs
|
1
|
normal
|
112
|
4
|
0
|
0
|
0
|
204
|
0.1
|
1
|
47
|
143
|
0
|
0
|
|
qwj1yf
|
1
|
reversible_defect
|
172
|
3
|
0
|
1
|
0
|
199
|
0.5
|
1
|
52
|
162
|
0
|
0
|
|
qvhk9e
|
1
|
normal
|
120
|
3
|
0
|
0
|
0
|
215
|
0.0
|
0
|
37
|
170
|
0
|
0
|
|
igwnqo
|
2
|
fixed_defect
|
126
|
3
|
1
|
1
|
0
|
218
|
2.2
|
1
|
59
|
134
|
0
|
1
|
|
4v0q7o
|
2
|
reversible_defect
|
178
|
4
|
2
|
1
|
0
|
228
|
1.0
|
0
|
66
|
165
|
1
|
1
|
|
hh2awp
|
2
|
normal
|
136
|
3
|
0
|
0
|
2
|
196
|
0.1
|
0
|
52
|
169
|
0
|
0
|
|
vfjppl
|
2
|
reversible_defect
|
124
|
4
|
1
|
0
|
2
|
266
|
2.2
|
1
|
54
|
109
|
1
|
1
|
|
6lu42b
|
2
|
reversible_defect
|
145
|
4
|
2
|
0
|
2
|
282
|
2.8
|
1
|
60
|
142
|
1
|
1
|
|
shiro4
|
2
|
reversible_defect
|
130
|
4
|
1
|
0
|
2
|
254
|
1.4
|
1
|
63
|
147
|
0
|
1
|
|
3wl3z4
|
3
|
normal
|
130
|
3
|
0
|
1
|
2
|
197
|
1.2
|
1
|
53
|
152
|
0
|
0
|
|
ebioez
|
2
|
reversible_defect
|
120
|
4
|
1
|
0
|
0
|
188
|
1.4
|
1
|
54
|
113
|
0
|
1
|
|
37c0vm
|
3
|
reversible_defect
|
110
|
2
|
0
|
0
|
0
|
229
|
1.0
|
1
|
48
|
168
|
0
|
1
|
|
v52zcs
|
2
|
reversible_defect
|
128
|
4
|
2
|
0
|
2
|
259
|
3.0
|
1
|
58
|
130
|
1
|
1
|
|
6nkcaw
|
1
|
normal
|
130
|
3
|
0
|
0
|
2
|
256
|
0.5
|
0
|
51
|
149
|
0
|
0
|
|
hfp05i
|
1
|
normal
|
118
|
3
|
3
|
0
|
2
|
149
|
0.8
|
1
|
49
|
126
|
0
|
1
|
|
grfxwd
|
1
|
normal
|
112
|
3
|
0
|
0
|
2
|
268
|
0.0
|
0
|
41
|
172
|
1
|
0
|
|
bvcxah
|
1
|
reversible_defect
|
140
|
4
|
1
|
0
|
0
|
177
|
0.0
|
1
|
59
|
162
|
1
|
1
|
|
i49srr
|
1
|
normal
|
150
|
3
|
0
|
0
|
0
|
168
|
1.6
|
1
|
57
|
174
|
0
|
0
|
|
93dbhq
|
1
|
normal
|
130
|
2
|
0
|
0
|
0
|
262
|
0.0
|
1
|
55
|
155
|
0
|
0
|
|
jscmp8
|
2
|
normal
|
134
|
2
|
0
|
0
|
0
|
271
|
0.0
|
0
|
49
|
162
|
0
|
0
|
|
zaytyf
|
2
|
normal
|
100
|
4
|
2
|
0
|
2
|
299
|
0.9
|
1
|
67
|
125
|
1
|
1
|
|
wze8qm
|
1
|
normal
|
135
|
3
|
0
|
1
|
0
|
304
|
0.0
|
0
|
54
|
170
|
0
|
0
|
|
w3933i
|
2
|
reversible_defect
|
140
|
4
|
2
|
0
|
2
|
293
|
1.2
|
1
|
60
|
170
|
0
|
1
|
|
7uch9x
|
2
|
normal
|
108
|
3
|
0
|
0
|
0
|
141
|
0.6
|
0
|
44
|
175
|
0
|
0
|
|
dy5hxt
|
1
|
reversible_defect
|
118
|
3
|
1
|
0
|
0
|
277
|
1.0
|
1
|
68
|
151
|
0
|
0
|
|
c0gkqc
|
2
|
fixed_defect
|
145
|
4
|
2
|
0
|
2
|
212
|
2.0
|
1
|
64
|
132
|
0
|
1
|
|
z5g5p3
|
2
|
normal
|
160
|
1
|
1
|
1
|
2
|
234
|
0.1
|
1
|
69
|
131
|
0
|
0
|
|
h3uzv8
|
1
|
normal
|
155
|
3
|
0
|
0
|
0
|
269
|
0.8
|
0
|
65
|
148
|
0
|
0
|
|
bthqr4
|
1
|
normal
|
150
|
1
|
0
|
1
|
2
|
283
|
1.0
|
0
|
58
|
162
|
0
|
0
|
|
rfj25e
|
1
|
normal
|
140
|
3
|
0
|
0
|
2
|
321
|
0.0
|
1
|
39
|
182
|
0
|
0
|
|
9f92et
|
2
|
normal
|
140
|
2
|
0
|
0
|
2
|
294
|
1.3
|
0
|
56
|
153
|
0
|
0
|
|
24fopx
|
2
|
reversible_defect
|
110
|
4
|
1
|
0
|
2
|
239
|
1.2
|
1
|
59
|
142
|
1
|
1
|
|
ldr1mz
|
1
|
normal
|
140
|
3
|
1
|
0
|
2
|
308
|
1.5
|
0
|
51
|
142
|
0
|
0
|
|
wokyol
|
1
|
reversible_defect
|
140
|
3
|
0
|
0
|
0
|
313
|
0.2
|
0
|
64
|
133
|
0
|
0
|
|
p5orwa
|
2
|
normal
|
130
|
4
|
2
|
0
|
0
|
303
|
2.0
|
0
|
64
|
122
|
0
|
0
|
|
s8dx1q
|
1
|
reversible_defect
|
150
|
3
|
0
|
0
|
2
|
232
|
1.6
|
1
|
54
|
165
|
0
|
0
|
|
7kf275
|
2
|
reversible_defect
|
160
|
4
|
1
|
0
|
2
|
289
|
0.8
|
1
|
55
|
145
|
1
|
1
|
|
e68djo
|
1
|
normal
|
125
|
1
|
1
|
0
|
2
|
213
|
1.4
|
1
|
51
|
125
|
1
|
0
|
|
3ze7pv
|
2
|
reversible_defect
|
124
|
4
|
0
|
0
|
2
|
274
|
0.5
|
1
|
48
|
166
|
0
|
1
|
|
0g192k
|
2
|
reversible_defect
|
128
|
4
|
1
|
0
|
0
|
263
|
0.2
|
1
|
64
|
105
|
1
|
0
|
|
3s141s
|
1
|
normal
|
120
|
2
|
0
|
0
|
0
|
244
|
1.1
|
0
|
50
|
162
|
0
|
0
|
|
6r9x2j
|
2
|
reversible_defect
|
140
|
4
|
3
|
0
|
0
|
298
|
4.2
|
1
|
51
|
122
|
1
|
1
|
|
sqddbc
|
2
|
reversible_defect
|
180
|
3
|
0
|
1
|
2
|
274
|
1.6
|
1
|
68
|
150
|
1
|
1
|
|
nizd9c
|
1
|
normal
|
112
|
3
|
0
|
0
|
0
|
250
|
0.0
|
1
|
41
|
179
|
0
|
0
|
|
lpub9d
|
1
|
normal
|
140
|
3
|
0
|
1
|
2
|
211
|
0.0
|
1
|
58
|
165
|
0
|
0
|
|
bv01fp
|
2
|
fixed_defect
|
135
|
2
|
0
|
0
|
0
|
203
|
0.0
|
1
|
41
|
132
|
0
|
0
|
|
9dqkpy
|
1
|
reversible_defect
|
110
|
4
|
0
|
0
|
2
|
172
|
0.0
|
1
|
41
|
158
|
0
|
1
|
|
2fqzg8
|
2
|
reversible_defect
|
132
|
4
|
1
|
0
|
0
|
353
|
1.2
|
1
|
55
|
132
|
1
|
1
|
|
1jruhz
|
2
|
normal
|
138
|
4
|
3
|
1
|
0
|
294
|
1.9
|
0
|
62
|
106
|
0
|
1
|
|
ju1wdc
|
2
|
normal
|
138
|
1
|
1
|
1
|
2
|
282
|
1.4
|
1
|
65
|
174
|
0
|
1
|
|
f4n8ny
|
1
|
normal
|
118
|
2
|
0
|
0
|
0
|
210
|
0.7
|
0
|
34
|
192
|
0
|
0
|
|
97v1yz
|
2
|
fixed_defect
|
140
|
4
|
0
|
0
|
0
|
192
|
0.4
|
1
|
57
|
148
|
0
|
0
|
|
6jcc1y
|
1
|
normal
|
130
|
3
|
3
|
1
|
2
|
246
|
0.0
|
1
|
53
|
173
|
0
|
0
|
|
tbo0wx
|
2
|
normal
|
160
|
4
|
3
|
0
|
2
|
286
|
1.5
|
1
|
67
|
108
|
1
|
1
|
|
4b32pd
|
1
|
normal
|
160
|
3
|
0
|
0
|
2
|
360
|
0.8
|
0
|
65
|
151
|
0
|
0
|
|
0ryxtv
|
2
|
normal
|
102
|
4
|
0
|
0
|
2
|
265
|
0.6
|
0
|
42
|
122
|
0
|
0
|
|
w1wgrq
|
2
|
reversible_defect
|
120
|
3
|
0
|
0
|
2
|
258
|
0.4
|
1
|
54
|
147
|
0
|
0
|
|
syvayq
|
3
|
reversible_defect
|
145
|
4
|
0
|
0
|
0
|
174
|
2.6
|
1
|
70
|
125
|
1
|
1
|
|
lq4ldx
|
3
|
normal
|
120
|
4
|
1
|
0
|
2
|
246
|
2.2
|
1
|
64
|
96
|
1
|
1
|
|
0zldrz
|
1
|
reversible_defect
|
94
|
3
|
1
|
0
|
0
|
227
|
0.0
|
1
|
51
|
154
|
1
|
0
|
|
98kb5h
|
2
|
normal
|
100
|
4
|
0
|
0
|
2
|
248
|
1.0
|
0
|
58
|
122
|
0
|
0
|
|
1wwvr0
|
1
|
normal
|
130
|
2
|
0
|
0
|
2
|
204
|
1.4
|
0
|
41
|
172
|
0
|
0
|
|
f1ziva
|
1
|
reversible_defect
|
132
|
3
|
2
|
0
|
2
|
224
|
3.2
|
1
|
58
|
173
|
0
|
1
|
|
e3dnw3
|
1
|
reversible_defect
|
125
|
4
|
2
|
0
|
2
|
300
|
0.0
|
1
|
58
|
171
|
0
|
1
|
|
08usun
|
1
|
reversible_defect
|
120
|
4
|
0
|
0
|
0
|
177
|
0.4
|
1
|
65
|
140
|
0
|
0
|
|
fzzkh7
|
1
|
reversible_defect
|
130
|
4
|
2
|
1
|
2
|
256
|
0.0
|
1
|
48
|
150
|
1
|
1
|
|
9upsjl
|
2
|
normal
|
150
|
4
|
0
|
0
|
0
|
244
|
1.4
|
0
|
62
|
154
|
1
|
1
|
|
ebloe5
|
1
|
normal
|
140
|
3
|
0
|
0
|
2
|
235
|
0.0
|
1
|
44
|
180
|
0
|
0
|
|
srm6ut
|
1
|
normal
|
130
|
2
|
0
|
0
|
2
|
219
|
0.0
|
1
|
44
|
188
|
0
|
0
|
|
noxsnw
|
3
|
reversible_defect
|
140
|
4
|
0
|
0
|
0
|
217
|
5.6
|
1
|
55
|
111
|
1
|
1
|
|
471q03
|
2
|
reversible_defect
|
120
|
1
|
0
|
0
|
2
|
193
|
1.9
|
1
|
56
|
162
|
0
|
0
|
|
uvwymz
|
1
|
reversible_defect
|
120
|
2
|
0
|
0
|
0
|
263
|
0.0
|
1
|
44
|
173
|
0
|
0
|
|
7eyvsi
|
2
|
reversible_defect
|
110
|
4
|
0
|
0
|
2
|
167
|
2.0
|
1
|
40
|
114
|
1
|
1
|
|
a946ij
|
2
|
reversible_defect
|
128
|
4
|
3
|
0
|
2
|
216
|
2.2
|
1
|
58
|
131
|
1
|
1
|
|
drdvf9
|
1
|
normal
|
140
|
2
|
2
|
0
|
0
|
195
|
0.0
|
0
|
63
|
179
|
0
|
0
|
|
z8yl4y
|
1
|
reversible_defect
|
140
|
1
|
0
|
0
|
0
|
199
|
1.4
|
1
|
40
|
178
|
1
|
0
|
|
mxabaz
|
2
|
normal
|
134
|
1
|
2
|
0
|
0
|
234
|
2.6
|
1
|
61
|
145
|
0
|
1
|
|
cpqg4x
|
1
|
normal
|
108
|
3
|
0
|
0
|
0
|
243
|
0.0
|
1
|
47
|
152
|
0
|
1
|
|
8c36yw
|
2
|
reversible_defect
|
128
|
3
|
1
|
0
|
2
|
229
|
0.4
|
1
|
57
|
150
|
0
|
1
|
|
x4yp0f
|
1
|
reversible_defect
|
108
|
2
|
0
|
0
|
0
|
309
|
0.0
|
1
|
54
|
156
|
0
|
0
|
|
9at0il
|
3
|
normal
|
125
|
3
|
1
|
0
|
2
|
273
|
0.5
|
1
|
54
|
152
|
0
|
0
|
|
nfag5b
|
2
|
reversible_defect
|
120
|
4
|
0
|
0
|
0
|
198
|
1.6
|
1
|
35
|
130
|
1
|
1
|
|
strmq8
|
1
|
normal
|
112
|
4
|
1
|
0
|
2
|
290
|
0.0
|
1
|
44
|
153
|
0
|
1
|
|
43k3gx
|
2
|
reversible_defect
|
130
|
3
|
1
|
0
|
0
|
263
|
1.2
|
0
|
62
|
97
|
0
|
1
|
|
fz84ac
|
1
|
normal
|
160
|
1
|
0
|
0
|
2
|
273
|
0.0
|
1
|
59
|
125
|
0
|
1
|
|
02cipp
|
1
|
normal
|
140
|
1
|
2
|
0
|
0
|
239
|
1.8
|
0
|
69
|
151
|
0
|
0
|
|
1ennzl
|
1
|
normal
|
130
|
3
|
0
|
0
|
0
|
275
|
0.2
|
0
|
48
|
139
|
0
|
0
|
|
isq8yp
|
1
|
normal
|
120
|
3
|
0
|
0
|
0
|
226
|
0.0
|
1
|
44
|
169
|
0
|
0
|
|
ewckbx
|
2
|
reversible_defect
|
130
|
4
|
2
|
0
|
2
|
206
|
2.4
|
1
|
60
|
132
|
1
|
1
|
|
dtljkq
|
1
|
normal
|
130
|
2
|
0
|
0
|
0
|
266
|
0.6
|
1
|
49
|
171
|
0
|
0
|
|
a2kf1z
|
1
|
reversible_defect
|
117
|
4
|
2
|
1
|
0
|
230
|
1.4
|
1
|
60
|
160
|
1
|
1
|
|
usnkhx
|
3
|
reversible_defect
|
160
|
4
|
3
|
0
|
2
|
164
|
6.2
|
0
|
62
|
145
|
0
|
1
|
|
hltlsl
|
2
|
reversible_defect
|
142
|
4
|
3
|
0
|
2
|
309
|
0.0
|
1
|
45
|
147
|
1
|
1
|
|
l0c19s
|
1
|
reversible_defect
|
142
|
4
|
0
|
0
|
2
|
226
|
0.0
|
1
|
53
|
111
|
1
|
0
|
|
lcexsf
|
1
|
normal
|
152
|
3
|
1
|
0
|
0
|
277
|
0.0
|
0
|
67
|
172
|
0
|
0
|
|
y3m2bd
|
1
|
reversible_defect
|
132
|
4
|
0
|
0
|
0
|
207
|
0.0
|
1
|
57
|
168
|
1
|
0
|
|
qcjf51
|
1
|
reversible_defect
|
120
|
4
|
0
|
0
|
2
|
249
|
0.8
|
1
|
46
|
144
|
0
|
1
|
|
7zbya5
|
3
|
fixed_defect
|
145
|
1
|
0
|
1
|
2
|
233
|
2.3
|
1
|
63
|
150
|
0
|
0
|
|
23gf0e
|
2
|
normal
|
110
|
1
|
0
|
0
|
2
|
211
|
1.8
|
1
|
64
|
144
|
1
|
0
|
|
qhz9ye
|
1
|
reversible_defect
|
150
|
4
|
0
|
0
|
2
|
270
|
0.8
|
1
|
58
|
111
|
1
|
1
|
|
u25507
|
1
|
normal
|
112
|
4
|
1
|
0
|
2
|
212
|
0.1
|
1
|
66
|
132
|
1
|
1
|
|
j9tw19
|
2
|
reversible_defect
|
118
|
4
|
0
|
0
|
0
|
219
|
1.2
|
1
|
39
|
140
|
0
|
1
|
|
5o32oi
|
1
|
reversible_defect
|
140
|
4
|
0
|
0
|
0
|
299
|
1.6
|
1
|
51
|
173
|
1
|
1
|
|
o63ri2
|
1
|
normal
|
140
|
4
|
0
|
0
|
0
|
239
|
1.2
|
1
|
54
|
160
|
0
|
0
|
|
5qfar3
|
2
|
reversible_defect
|
125
|
4
|
2
|
1
|
0
|
254
|
0.2
|
1
|
67
|
163
|
0
|
1
|
|
2s2b1f
|
2
|
normal
|
180
|
4
|
0
|
0
|
1
|
327
|
3.4
|
0
|
55
|
117
|
1
|
1
|
|
nsd00i
|
2
|
reversible_defect
|
125
|
3
|
0
|
0
|
0
|
309
|
1.8
|
1
|
64
|
131
|
1
|
1
|
|
0xw93k
|
1
|
normal
|
124
|
3
|
2
|
1
|
0
|
255
|
0.0
|
1
|
48
|
175
|
0
|
0
|
|
2nx10r
|
1
|
normal
|
160
|
3
|
1
|
0
|
0
|
201
|
0.0
|
0
|
54
|
163
|
0
|
0
|
Description of the dataset
There are 14 columns in the dataset, where the patient_id column is a unique and random identifier. The remaining 13 features are described below.
slope_of_peak_exercise_st_segment (type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
thal (type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values normal, fixed_defect, reversible_defect
resting_blood_pressure (type: int): resting blood pressure
chest_pain_type (type: int): chest pain type (4 values)
num_major_vessels (type: int): number of major vessels (0-3) colored by flourosopy
fasting_blood_sugar_gt_120_mg_per_dl (type: binary): fasting blood sugar > 120 mg/dl
resting_ekg_results (type: int): resting electrocardiographic results (values 0,1,2)
serum_cholesterol_mg_per_dl (type: int): serum cholestoral in mg/dl
oldpeak_eq_st_depression (type: float): oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
sex (type: binary): 0: female, 1: male
age (type: int): age in years
max_heart_rate_achieved (type: int): maximum heart rate achieved (beats per minute)
exercise_induced_angina (type: binary): exercise-induced chest pain (0: False, 1: True)
Data preparation and preprocessing
Examine class label
As shown in the chart, we have 100 (56%) patients that have no heart disease while 80 (44%) of the patients have heart disease.
theme_set(theme_bw())# The current theme is automatically applied to every plot we draw
disease_data <- disease_data %>% mutate(heart_disease_present= fct_recode(heart_disease_present, absent = "0", present = "1"))
chart <- disease_data %>% count(heart_disease_present) %>% mutate(pct= round(n/sum(n)*100)) %>% ggplot(aes(x = heart_disease_present, y=n, fill= heart_disease_present)) + geom_bar(stat = 'identity', width = 0.4, show.legend = FALSE)+ labs(x='Heart disease present', y='Number of patients', caption = "Source: Heart disease data") +
scale_fill_manual(values = c( "present" = "red", "absent"= "green"), aesthetics = 'fill') +
geom_text(aes(label= str_c(pct, "%")), vjust= 4.5, size=2.5, colour='black') +
theme(legend.position = 'top', axis.title.y = element_text(size = 12, face='bold'), axis.title.x =element_text(size = 12, face='bold'), axis.text.x = element_text(angle = 50, vjust = 0.3, face = 'bold'))
ggplotly(chart, tooltip = c("x", "y"))
Visualize class separation by numeric features
The primary goal of visualization for classification problems is to understand which features are useful for class separation. In this section, we will start by visualizing the separation quality of numeric features.
# If you have column name as a character vector (e.g. col= 'age'), use .data[[col]]. If the column name or expression is supplied by the user, you can pass it to aes() or vars() using {{col}} i.e. curly curly syntax.(This is rlang syntax for writing a function)
plot_box = function(df, cols, col_x='heart_disease_present'){
for(col in cols){
p = ggplot(df, aes(x = .data[[col_x]], y = .data[[col]], fill = .data[[col_x]]))+ geom_boxplot(show.legend = FALSE) +
scale_fill_manual(values = c( "present" = "red", "absent" = "green"), aesthetics = 'fill') +
labs(x='Heart disease present', y= str_c(col),
title=str_c('Box plot of', col, '\n vs.',
col_x, sep = ' '), caption = "Source: Heart disease data") + theme(axis.text.x = element_text(face = 'bold'), axis.title.y = element_text(size = 12, face='bold'), axis.title.x =element_text(size = 12, face='bold'))
print(p)
}
}
num_cols = disease_data %>% select_if(is.numeric) %>% colnames()
plot_box(disease_data, num_cols)






Box plots are useful, since by construction we are forced to focus on the overlap (or not) of the quartiles of the distribution. In this case, we might ask the question like: is there sufficient differences in the quartiles for the feature to be useful in separation the label classes? There are two cases displayed above:
For age, max_heart_rate_achieved,oldpeak_eq_st_depression, num_major_vessels, there is useful separation between absent and present of heart disease patients. As one might expect, older people tends to have heart disease compared to the younger one.
On the other hand, resting_blood_pressure and serum_cholesterol_mg_per_dl does not seem to matter.
Visualizing class separation by categorical features
Now we will turn to the problem of visualizing the ability of categorical features to separate classes of the label. Ideally, a categorical feature will have very different counts of the categories for each of the label values. A good way to visualize these relationships is with bar plots. The code in the cell below creates side by side plots of the categorical variables for each of the labels categories.
# Since the facet_var will be supplied by the user, we pass curly curly syntax to vars() using {{facet_var}}
plot_bars = function(df, cat_cols, facet_var){
for(col in cat_cols){
p = ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
geom_bar(show.legend = F) +
labs(x = col, y = "Number of patients", title = str_c('Bar plot of ', col, '\nfor heart disease')) +
facet_wrap(vars({{facet_var}}), scales = 'free_y') +
theme(axis.title.y = element_text(size = 12, face = 'bold'), axis.title.x = element_text(size = 12, face = 'bold'),
axis.text.x = element_text(angle = 90, hjust = 1, face = 'bold'))
print(p)
}
}
cat_cols <- disease_data %>% select_if(is.factor) %>% colnames()
cat_cols <- cat_cols[-8] # removing the class label
plot_bars(disease_data, cat_cols, heart_disease_present)







There is a lot of information in these plots. The key to interpretation of these plots is comparing the proportion of the categories for each of the label values. If these proportions are distinctly different for each label category, the feature is likely to be useful in separating the label.
There are several cases evident in these plots:
Some features such as slope_of_peak_exercise_st_segment, thal, and chest_pain_type have significantly different distribution of categories between the label categories.
Others features such as fasting_blood_sugar_gt_120_mg_per_dl, exercise_induced_angina and sex show small differences, but these differences are unlikely to be significant.
Other feature like resting_ekg_results has a dominant category with very few cases of other categories. These features will likely have very little power to separate the cases.
Notice that only a few of these categorical features will be useful in separating the cases.
Model building
In this secton we will perform two-class classification using binary classifiers. A classifier is a machine learning model that separates the label into categories or classes. In other words, classification models are supervised machine learning models which predict a categorical label. Common examples of classifiers include logistic regression, K Nearest Neighbour (KNN), Support Vector Machines (SVM), Random forest (RF), and Artificial neural network (NN). Some classifiers such as logistic regression, artificial neural network, gaussian process and random forest are known to predict probability of a given instance belonging to a particular class and are therefore called probabilistic classifiers. Classifiers of this nature use statistical inference to categorize the best label for a given instance. A predicted probability can then be converted into a class value by selecting the class label that has the highest probability. Unlike other algorithm such as K nearest neighbour which simply output the best class for a given instance. Classifiers such as logistic regression and adaboost were designed primarily for solving binary class problem and therefore will not work for multi-class problem.
In this case, our machine learning models used heart disease data to determine if a particular patient has heart disease or not. Thus, heart disease of patient is the classes we must predict.
Split the heart disease dataset into training and test datasets
we will create randomly sampled training and test data sets. The createDataPartition() function from the R caret package is used to create indices for the training data sample. In this case 80% of the data will be used for training the model. Since this data set is small, only 36 cases will be included in the test dataset. Execute this code and note the dimensions of the resulting data frame.
## The dimension of the training set is ( 144 14 )
## The dimension of test set is ( 36 14 )
Scale numeric features
Numeric features must be rescaled so they have a similar range of values. Rescaling prevents features from having an undue influence on model training simply because then have a larger range of numeric variables.
The code in the cell below uses the preProcess() function from the caret function. The processing is as follows:
The preprocessing model object is computed. In this case the processing includes centering and scaling the numeric feature. Notice that this model is fit only to the training data.
The scaling is applied to both the test and training partitions.
## Created from 144 samples and 14 variables
##
## Pre-processing:
## - centered (6)
## - ignored (8)
## - scaled (6)
One hot encoding
To convert all the nominal or factor variables to numeric, we need to create a dummy variable for each category of the categorical variables. Only one dummy variable is coded with a one for each set of categories. This is known as one hot encoding. By using numeric dummy variable, the entire training feature array is now numeric. To do this, we use dummyVars() function from the caret package. A predict method is applied to create numeric model matrices for training and test.
FALSE Dummy Variable Object
FALSE
FALSE Formula: ~.
FALSE <environment: 0x000000002d66b408>
FALSE 13 variables, 7 factors
FALSE Variables and levels will be separated by '.'
FALSE A full rank encoding is used
Feature selection
Feature selection can be an important part of model selection. In supervised learning, including features in a model which do not provide information on the label, is useless at best, and may prevent generalization at worst.
Feature selection can involve application of several methods. Two important methods include:
Eliminating features with low variance and zero variance. Zero variance features are comprised of the same values. Low variance features arise from features with most values the same and with few unique values. One way low variance features can arise, is from dummy variables for categories with very few members. The dummy variable will be mostly 0s with very few 1s.
Training machine learning models with features that are uninformative can create a variety of problems. An uninformative feature does not significantly improve model performance. In many cases, the noise in the uninformative features will increase the variance of the model predictions. In other words, uninformative models are likely to reduce the ability of the machine learning model to generalize.
FALSE freqRatio percentUnique zeroVar nzv
FALSE thal.fixed_defect 23 1.388889 FALSE TRUE
FALSE resting_ekg_results.1 143 1.388889 FALSE TRUE
We will remove the features that will not make our models to generalize well on the test and validation datasets.
Evaluation metric
The metric used for the evaluation of the performance of each model is logarithmic loss. In order to calculate Log Loss, the classifier must assign a probability to each class rather than simply yielding the most likely class. Mathematically Log Loss is defined as :
\[\text{Logloss}= - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i)\right]\]
where:
\(n\) - number of observations
\(log\) - the natural logarithm
\(y\) - a binary indicator (\(0\) or \(1\)) of whether class label \(c\) is the correct classification for observation \(o\)
\(p\) - the model’s predicted probability that observation \(o\) is of class \(c\).
Logarithmic loss provides a steep penalty for predictions that are both confident and wrong. That is, it takes into account the uncertainty of our model prediction based on how much it varies from the actual label. Logloss has no upper bound and it exists on the range \([0, \infty)\). Logloss nearer to \(0\) indicates higher accuracy, whereas if the logloss is away from \(0\) then it indicates lower accuracy. In general, the least logloss gives greater accuracy for the classifier. The goal is to minimize the logloss and a perfect classifier would have a logloss of precisely zero while less ideal classifiers have progressively larger values of logloss.
Computational section
The predictive performance of machine learning models depend on the structure of the dataset and proper data preparation will ensure the models work optimally. Since the best machine learning method on dataset cannot be known beforehand, in this section, we consider different catalogs of machine learning algorithms which included parametric and non parametric on heart disease data and we evaluated the performance of each model with logloss metric on test data.
R functions
We wrote two functions such as prob.prediction() and logloss().
prob.prediction(): This function takes yhat.model as input and that enables us to have a dataframe that comprises the probability of model class prediction and the label of the test set.
logloss(): The metric for evaluating performance of each classifier that was used in the heart disease data. This function as two input namely actual which is the true class label from the test set and predicted which is the probability of the classifier’s class prediction.
# Models to consider
# LDA model
lda.model<- train(y ~ ., data = xytrain, method = "lda", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")
yhat.lda <- predict(lda.model, xtest, type = "prob")
lda_data <- prob.prediction(yhat.model = yhat.lda)
logloss_lda <- logloss(lda_data$ytest, lda_data$prob)
# GBM model
GBM.model <- train(y~., data=xytrain, method = "gbm", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, seeds = vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)), metric = "ROC", tuneGrid = expand.grid(interaction.depth = 1:2, shrinkage = .1, n.trees = c(10, 50, 100), n.minobsinnode = 10),verbose = FALSE)
yhat.GBM <- predict(GBM.model, xtest, type = "prob")
GBM_data <- prob.prediction(yhat.model = yhat.GBM)
logloss_GBM <- logloss(GBM_data$ytest, GBM_data$prob)
# SVM model
svm.model <- train(y ~ ., data = xytrain, method = "svmLinear2", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE), tuneGrid = data.frame(cost = c(.25, .5, 1)))
yhat.svm <- predict(svm.model, xtest, type = "prob")
svm_data <- prob.prediction(yhat.model = yhat.svm)
logloss_svm <- logloss(svm_data$ytest, svm_data$prob)
# KNN model
knn.model <- train(y ~.,data=xytrain, method = "knn", trControl=trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")
yhat.knn <- predict(knn.model, newdata = xtest, type="prob")
knn_data <- prob.prediction(yhat.model = yhat.knn)
logloss_knn <- logloss(knn_data$ytest, knn_data$prob)
# Ctree
ctree.model <- train(y ~ ., data = xytrain, method = "ctree", trControl = trainControl(method = "cv", number = 10, returnResamp = "all"))
yhat.ctree <- predict(ctree.model, newdata = xtest, type="prob")
ctree_data <- prob.prediction(yhat.model = yhat.ctree)
logloss_ctree <- logloss(ctree_data$ytest, ctree_data$prob)
# CART
cart.model <- train(y ~ ., data = xytrain, method = "rpart", trControl = trainControl(method = "cv", number = 10, returnResamp = "all"))
yhat.cart <- predict(cart.model, newdata = xtest, type="prob")
cart_data <- prob.prediction(yhat.model = yhat.cart)
logloss_cart <- logloss(cart_data$ytest, cart_data$prob)
# cforest
cforest.model <- train(y ~ ., data = xytrain, method = "cforest", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, seeds = vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)), metric = "ROC", controls = party::cforest_unbiased(ntree = 20))
yhat.cforest <- predict(cforest.model, newdata = xtest, type="prob")
cforest_data <- prob.prediction(yhat.model = yhat.cforest)
logloss_cforest <- logloss(cforest_data$ytest, cforest_data$prob)
# gausspr model
gausspr.model <- gausspr(y~., data=xytrain)
FALSE Using automatic sigma estimation (sigest) for RBF or laplace kernel
yhat.gausspr <- predict(gausspr.model, xtest, type='prob')
gausspr_data <- prob.prediction(yhat.model = yhat.gausspr)
logloss_gausspr <- logloss(gausspr_data$ytest, gausspr_data$prob)
# rForest
rforest.model <- train(y ~ ., data = xytrain, method = "rf", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE,summaryFunction = twoClassSummary, seeds = vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)),metric = "ROC", ntree = 20,importance = TRUE)
yhat.rforest <- predict( rforest.model, xtest, type='prob')
rforest_data <- prob.prediction(yhat.model = yhat.rforest)
logloss_rforest <- logloss(rforest_data$ytest, rforest_data$prob)
# Adaboost
adaboost.model <- train(y ~ ., data = xytrain, method = "adaboost", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, seeds =vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)),metric = "ROC")
yhat.adaboost <- predict( adaboost.model, xtest, type='prob')
adaboost_data <- prob.prediction(yhat.model = yhat.adaboost)
logloss_adaboost <- logloss(adaboost_data$ytest, adaboost_data$prob)
# Nnet
nnet.model <- train(y ~ ., data = xytrain, method = "nnet", trControl = trainControl(method = "cv", number = 10, returnResamp = "all"),
trace = FALSE)
yhat.nnet <- predict(nnet.model, xtest, type="prob")
nnet_data <- prob.prediction(yhat.model = yhat.nnet)
logloss_nnet <- logloss(nnet_data$ytest, nnet_data$prob)
# LogitBoost
logit.model <- train(y ~ ., data = xytrain, method = "LogitBoost", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")
yhat.logit <- predict(logit.model, xtest, type="prob")
logit_data <- prob.prediction(yhat.model = yhat.logit)
logloss_logit <- logloss(logit_data$ytest, logit_data$prob)
# NaiveBayes
naiveBayes.model <- train( y ~ ., data = xytrain, method = "naive_bayes", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")
yhat.naiveBayes <- predict(naiveBayes.model, xtest, type='prob')
naiveBayes_data <- prob.prediction(yhat.model = yhat.naiveBayes)
logloss_naiveBayes <- logloss(naiveBayes_data$ytest, naiveBayes_data$prob)
# MARS model
mars.model <- train(y~., data=xytrain, method='earth', trControl=trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE), tuneGrid = data.frame(degree = 1, nprune = (2:4)*2))
yhat.mars <- predict(mars.model, xtest, type='prob')
mars_data <- prob.prediction(yhat.model = yhat.mars)
logloss_mars <- logloss(mars_data$ytest, mars_data$prob)
# glmnet model
glmnet.model <- train(y~., data=xytrain, method='glmnet', trControl =trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary),metric = "ROC", tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15), .lambda = c((1:5)/10)))
yhat.glmnet <- predict(glmnet.model, xtest, type='prob')
glmnet_data <- prob.prediction(yhat.model = yhat.glmnet)
logloss_glmnet <- logloss(glmnet_data$ytest, glmnet_data$prob)
# xgbTree
xgbtree.model <- train(y ~ ., data = xytrain, method = "xgbTree", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC", tuneGrid = expand.grid(nrounds = c(1, 10), max_depth = c(1, 4), eta = c(.1, .4), gamma = 0, colsample_bytree = .7, min_child_weight = 1,subsample = c(.8, 1)))
yhat.xgbtree <- predict(xgbtree.model, xtest, type='prob')
xgbtree_data <- prob.prediction(yhat.model = yhat.xgbtree)
logloss_xgbtree <- logloss(xgbtree_data$ytest, xgbtree_data$prob)
j48.model <- train(y ~ ., data = xytrain, method = "J48", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")
yhat.j48 <- predict(j48.model, xtest, type='prob')
j48_data <- prob.prediction(yhat.model = yhat.j48)
logloss_j48 <- logloss(xgbtree_data$ytest, j48_data$prob)
# Models logloss comparision to check for the best model
evaluation <- tibble(lda= logloss_lda, GBM= logloss_GBM, SVM =logloss_svm, KNN = logloss_knn, ctree = logloss_ctree, CART = logloss_cart, cforest = logloss_cforest, `gauss process` = logloss_gausspr, `Random forest` = logloss_rforest, Adaboost = logloss_adaboost, `Neural network` = logloss_nnet, "Logistic" = logloss_logit, Naivebayes = logloss_naiveBayes, MARS = logloss_mars, glmnet = logloss_glmnet, XGBtree =logloss_xgbtree, j48=logloss_j48)
logloss= t(evaluation)
Model = rownames(logloss)
comparision_table= as_tibble(logloss) %>% add_column(Model) %>% rename(logloss="V1") %>% arrange(logloss) %>% add_column(SN= 1:length(Model),.after = 0) %>% dplyr::select(SN, Model, logloss)
Evaluation table
Models performance with logloss metric
|
SN
|
Model
|
logloss
|
|
1
|
glmnet
|
0.8130325
|
|
2
|
Adaboost
|
0.8239399
|
|
3
|
cforest
|
0.8649421
|
|
4
|
gauss process
|
0.8711982
|
|
5
|
XGBtree
|
0.9362978
|
|
6
|
SVM
|
0.9643937
|
|
7
|
MARS
|
0.9955285
|
|
8
|
ctree
|
1.0239794
|
|
9
|
Neural network
|
1.1142865
|
|
10
|
CART
|
1.1233441
|
|
11
|
GBM
|
1.3182116
|
|
12
|
lda
|
1.3716846
|
|
13
|
Logistic
|
1.7355291
|
|
14
|
Random forest
|
1.9371862
|
|
15
|
Naivebayes
|
3.3867352
|
|
16
|
j48
|
5.7350614
|
|
17
|
KNN
|
6.3656432
|
Optimal model
|
SN
|
Model
|
logloss
|
|
1
|
glmnet
|
0.8130325
|
This shows that glmnet is the optimal model.
Generalizing optimal model on validation dataset
We therefore, use the optimal model glmnet for predicting the class label on validation data.
# Validation dataset
validation_data <- read_csv("Data from DRIVENDATA/test_values.csv")
# Data wrangling and preprocessing of validation data
validation_data <- validation_data[-1] # patient_id column dropped
factor_variable_position <- c(1, 2, 4, 6, 7, 10, 13)
validation_data<- validation_data %>% mutate_at(vars(factor_variable_position),as_factor)
# Applying preprocessing to the validation dataset
validation_data <- predict(preProcess_scale_model,newdata = validation_data)
validation_dummy <- predict(dummies, newdata = validation_data)
# Convert to dataframe
xvalidation <- as_tibble(validation_dummy)
# Remove low variance columns on validation set
xvalidation <- xvalidation %>% dplyr::select(-c(resting_ekg_results.1,
thal.fixed_defect))
# Predicting the class label
validation_label <- predict(adaboost.model, xvalidation)
# Convert to data frame
validation_label <- tibble(heart_disease_present=validation_label)
validation_table <- validation_label %>% count(heart_disease_present) %>% mutate(Percent = round(n/sum(n)*100, 2))
Class prediction label
Summary of predicted labels on validation data
|
heart_disease_present
|
n
|
Percent
|
|
absent
|
48
|
53.33
|
|
present
|
42
|
46.67
|
## Parsed with column specification:
## cols(
## patient_id = col_character(),
## heart_disease_present = col_double()
## )
Validation data class prediction using optimal model (Adaboost classifier)
|
patient_id
|
heart_disease_present
|
|
olalu7
|
present
|
|
z9n6mx
|
absent
|
|
5k4413
|
present
|
|
mrg7q5
|
absent
|
|
uki4do
|
present
|
|
kev1sk
|
absent
|
|
9n6let
|
absent
|
|
jxmtyg
|
present
|
|
51s2ff
|
present
|
|
wi9mcs
|
absent
|
|
741h4l
|
absent
|
|
1ef64a
|
absent
|
|
wa2ix6
|
absent
|
|
8167zl
|
present
|
|
n6nldr
|
absent
|
|
ph85fp
|
absent
|
|
jfan5p
|
absent
|
|
7c4iz1
|
absent
|
|
ukigml
|
present
|
|
flwvnq
|
absent
|
|
5i4fw2
|
present
|
|
du1pqf
|
absent
|
|
vs68qz
|
absent
|
|
pfyez0
|
absent
|
|
azvkw2
|
present
|
|
cird1i
|
present
|
|
3bg32t
|
absent
|
|
xzd050
|
absent
|
|
eyi8et
|
present
|
|
ce4x2h
|
absent
|
|
sm91nr
|
present
|
|
2il8hh
|
absent
|
|
yq9cqg
|
present
|
|
520v5j
|
absent
|
|
ammgu2
|
present
|
|
jix8hj
|
absent
|
|
lj5zrq
|
present
|
|
16ceba
|
absent
|
|
93w44s
|
absent
|
|
bso17z
|
present
|
|
j2w2dc
|
present
|
|
74vwwl
|
absent
|
|
0z3fob
|
present
|
|
mr7zyz
|
absent
|
|
pp5n63
|
present
|
|
j0hix1
|
absent
|
|
rn209i
|
absent
|
|
nfit8e
|
present
|
|
nb73sy
|
present
|
|
i79t3w
|
present
|
|
9nv2d9
|
present
|
|
2xbeja
|
absent
|
|
lwg3wq
|
present
|
|
lrvqwb
|
absent
|
|
c6mepo
|
absent
|
|
6ued22
|
absent
|
|
112e9h
|
present
|
|
8jc7h2
|
absent
|
|
unykmj
|
absent
|
|
4yeztb
|
present
|
|
tgpy9u
|
absent
|
|
pf5wp6
|
present
|
|
cj8vj2
|
absent
|
|
9w6d9j
|
present
|
|
3l89wd
|
absent
|
|
83a6x1
|
present
|
|
oua0gr
|
present
|
|
j0hl96
|
present
|
|
dlkzyg
|
present
|
|
r0w4a8
|
absent
|
|
46dlca
|
absent
|
|
9fkefu
|
present
|
|
6uk6kl
|
present
|
|
c7olxr
|
present
|
|
iiyx0q
|
present
|
|
25vetx
|
present
|
|
073vc5
|
present
|
|
18abn0
|
absent
|
|
v5fsfs
|
absent
|
|
2ekoo7
|
absent
|
|
5bbknr
|
present
|
|
hr6pjx
|
absent
|
|
r4hsar
|
absent
|
|
4cezdf
|
absent
|
|
palhcc
|
present
|
|
bwoyg6
|
absent
|
|
j8i7ve
|
present
|
|
t2zn1n
|
absent
|
|
oxf8kj
|
present
|
|
aeiv0y
|
absent
|
Conclusion
This study considered \(17\) different catalogs of machine learning models which are carefully selected from the set of parametric and/or non-parametric models to choose the best optimal model with the least logloss. caret package was used to tune our different models and the optimal model glmnet predicted (48, 53.33%) has absent while (42, 46.67%) were predicted to have present of heart disease on the validation dataset.
If you like this writeup, you can also follow me on Twitter and Linkedin for more updates in R and Python for datascience.
---
title: "Machine Learning with a Heart"
subtitle: "Predicting Heart Disease"
author: |
  | [Ogundepo Ezekiel Adebayo](https://bit.ly/gbganalyst)
  | [I'm on Twitter](https://twitter.com/gbganalyst)

date: "`r format(Sys.time(), '%B %d, %Y')`"
output: 
  html_document:
    theme: united
    highlight: espresso
    toc: true
    number_sections: true
    toc_depth: 3
    toc_float: true
    code_download: true
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Heart disease refers to several types of heart conditions and it is the [number one cause of death worldwide](www.world-heart-federation.org/resources/cardiovascular-diseases-cvds-global-facts-figures/). To prevent heart disease, we must first learn how to reliably detect it. The heart disease data used in this study has various measurements on patients health and cardiovascular statistics.

![Heart disease](Images/Heart_disease.PNG)

Source: UChicago Medicine

# Source of data

This study used dataset from a study of heart disease that has been open to the public at the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php) which is being maintained by the Center for Machine Learning and Intelligent Systems at the University of California, Irvine.


## Import packages

We shall use different set of packages in R

* Data preparation and exploration
* Machine learning packages
* Table formating

```{r R_library, message=FALSE, warning=FALSE, comment=FALSE, include=TRUE}

#Import packages

data_exploration_packages <- c("tidyverse", "plotly", "openxlsx")

machine_learning_packages <- c("caret","MASS", "car", "kernlab","rpart","randomForest","class","ada", "rda","e1071", "nnet","ipred", "dbarts", "klaR", "glmnet", 'earth')

table_formating_packages <- c("knitr","kableExtra")


if(!require(install.load)){
  install.packages("install.load")
}

install.load::install_load(c(data_exploration_packages, machine_learning_packages, table_formating_packages))

```


## Load and prepare the dataset

As a first step we must load the dataset. 

```{r Training dataset, message=FALSE, warning=FALSE, comment=FALSE, cache=TRUE,include=TRUE}

# Import dataset

data_values <- read_csv("Data from DRIVENDATA/train_values.csv")
data_labels <- read_csv("Data from DRIVENDATA/train_labels.csv")

# Concatenating the two datasets 

disease_data <- bind_cols(data_values, data_labels[,2])

```


> **Overview of heart disease dataset**

```{r kable1, eval=T, include=T,echo=F}

kable(disease_data, caption = 'Heart disease data', align = rep('c', 15)) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), fixed_thead = T) %>% scroll_box(width = "900px", height = "500px")
 
```


## Description of the dataset

There are 14 columns in the dataset, where the patient_id column is a unique and random identifier. The remaining 13 features are described below.

* slope_of_peak_exercise_st_segment (type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart

* thal (type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values normal, fixed_defect, reversible_defect

* resting_blood_pressure (type: int): resting blood pressure

* chest_pain_type (type: int): chest pain type (4 values)

* num_major_vessels (type: int): number of major vessels (0-3) colored by flourosopy

* fasting_blood_sugar_gt_120_mg_per_dl (type: binary): fasting blood sugar > 120 mg/dl

* resting_ekg_results (type: int): resting electrocardiographic results (values 0,1,2)

* serum_cholesterol_mg_per_dl (type: int): serum cholestoral in mg/dl

* oldpeak_eq_st_depression (type: float): oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms

* sex (type: binary): 0: female, 1: male

* age (type: int): age in years

* max_heart_rate_achieved (type: int): maximum heart rate achieved (beats per minute)

* exercise_induced_angina (type: binary): exercise-induced chest pain (0: False, 1: True)


# Data preparation and preprocessing

## Transform column data type

There are eight columns in this dataset which do not have the correct data type as expected. This is a common situation, as the methods used to automatically determine data type when loading files can fail sometimes. The code in the cell below converted the variables into the right format i.e. factors. 


```{r heart disease dataset, message=FALSE, warning=TRUE, comment=FALSE, cache=TRUE,include=TRUE}

disease_data <-  disease_data[-1] #  patient_id column was dropped

factor_variable_position <- c(1, 2, 4, 6, 7, 10, 13, 14)

disease_data <- disease_data %>% mutate_at(vars(factor_variable_position),as_factor)

```


The categorical features are now coded well. Additionally, the label is now coded as a binary variable. 

## Examine class label 

As shown in the chart, we have 100 (56%) patients that have no heart disease while 80 (44%) of the patients have heart disease. 

```{r chart 1}

theme_set(theme_bw())# The current theme is automatically applied to every plot we draw

disease_data <- disease_data %>% mutate(heart_disease_present= fct_recode(heart_disease_present, absent = "0", present = "1"))

chart <- disease_data %>% count(heart_disease_present) %>% mutate(pct= round(n/sum(n)*100)) %>% ggplot(aes(x = heart_disease_present, y=n, fill= heart_disease_present)) + geom_bar(stat = 'identity', width = 0.4, show.legend = FALSE)+ labs(x='Heart disease present', y='Number of patients', caption = "Source: Heart disease data") +
  scale_fill_manual(values = c( "present" = "red", "absent"= "green"), aesthetics = 'fill') +
  geom_text(aes(label= str_c(pct, "%")), vjust= 4.5, size=2.5, colour='black') +
theme(legend.position = 'top', axis.title.y = element_text(size = 12, face='bold'), axis.title.x =element_text(size = 12, face='bold'), axis.text.x = element_text(angle = 50, vjust = 0.3, face = 'bold'))

ggplotly(chart, tooltip = c("x", "y"))

```


## Visualize class separation by numeric features

The primary goal of visualization for classification problems is to understand which features are useful for class separation. In this section, we will start by visualizing the separation quality of numeric features. 

```{r chart2}

# If you have column name as a character vector (e.g. col= 'age'), use .data[[col]]. If the column name or expression is supplied by the user, you can pass it to aes() or vars() using {{col}} i.e. curly curly syntax.(This is rlang syntax for writing a function)

plot_box = function(df, cols, col_x='heart_disease_present'){
  for(col in cols){
    p = ggplot(df, aes(x = .data[[col_x]], y = .data[[col]], fill = .data[[col_x]]))+ geom_boxplot(show.legend = FALSE) +
      scale_fill_manual(values = c( "present" = "red", "absent" = "green"), aesthetics = 'fill') +
      labs(x='Heart disease present', y= str_c(col),
       title=str_c('Box plot of', col, '\n vs.',
      col_x, sep = ' '), caption = "Source: Heart disease data")  + theme(axis.text.x = element_text(face = 'bold'), axis.title.y = element_text(size = 12, face='bold'), axis.title.x =element_text(size = 12, face='bold'))
    
  print(p) 
  }
}

num_cols = disease_data %>% select_if(is.numeric) %>% colnames()

plot_box(disease_data, num_cols)    
```


Box plots are useful, since by construction we are forced to focus on the overlap (or not) of the quartiles of the distribution. In this case, we might ask the question like: is there sufficient differences in the quartiles for the feature to be useful in separation the label classes? There are two cases displayed above:

1. For age, max_heart_rate_achieved,oldpeak_eq_st_depression, num_major_vessels, there is useful separation between absent and present of heart disease patients. As one might expect, older people tends to have heart disease compared to the younger one. 

2. On the other hand, resting_blood_pressure and serum_cholesterol_mg_per_dl does not seem to matter. 

## Visualizing class separation by categorical features

Now we will turn to the problem of visualizing the ability of categorical features to separate classes of the label. Ideally, a categorical feature will have very different counts of the categories for each of the label values. A good way to visualize these relationships is with bar plots. The code in the cell below creates side by side plots of the categorical variables for each of the labels categories.

```{r chart3}

# Since the facet_var will be supplied by the user, we pass curly curly syntax to vars() using {{facet_var}}

plot_bars = function(df, cat_cols, facet_var){
   for(col in cat_cols){
    p = ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) + 
      geom_bar(show.legend = F)  +
      labs(x = col, y = "Number of patients", title = str_c('Bar plot of ', col, '\nfor heart disease')) +
      facet_wrap(vars({{facet_var}}), scales = 'free_y') +
     theme(axis.title.y = element_text(size = 12, face = 'bold'), axis.title.x = element_text(size = 12, face = 'bold'),
            axis.text.x = element_text(angle = 90, hjust = 1, face = 'bold'))
    
    print(p)
    
  }
}

cat_cols <- disease_data %>% select_if(is.factor) %>% colnames()
cat_cols <- cat_cols[-8] # removing the class label

plot_bars(disease_data, cat_cols, heart_disease_present)    
```


There is a lot of information in these plots. The key to interpretation of these plots is comparing the proportion of the categories for each of the label values. If these proportions are distinctly different for each label category, the feature is likely to be useful in separating the label.  

There are several cases evident in these plots:

1. Some features such as slope_of_peak_exercise_st_segment, thal, and chest_pain_type have significantly different distribution of categories between the label categories.

2. Others features such as fasting_blood_sugar_gt_120_mg_per_dl, exercise_induced_angina and sex  show small differences, but these differences are unlikely to be significant.

3. Other feature like resting_ekg_results has a dominant category with very few cases of other categories. These features will likely have very little power to separate the cases.

Notice that only a few of these categorical features will be useful in separating the cases.

# Model building

In this secton we will perform **two-class classification** using **binary classifiers**. A classifier is a machine learning model that separates the **label** into categories or **classes**. In other words, classification models are **supervised** machine learning models which predict a categorical label. Common examples of classifiers include logistic regression, K Nearest Neighbour (KNN), Support Vector Machines (SVM), Random forest (RF), and Artificial neural network (NN). Some classifiers such as logistic regression, artificial neural network, gaussian process and random forest are known to predict probability of a given instance belonging to a particular class and are therefore called probabilistic classifiers. Classifiers of this nature use statistical inference to categorize the best label for a given instance. A predicted probability can then be converted into a class value by selecting the class label that has the highest probability. Unlike other algorithm such as K nearest neighbour which simply output the best class for a given instance. Classifiers such as logistic regression and adaboost were designed primarily for solving binary class problem and therefore will not work for multi-class problem.


In this case, our machine learning models used heart disease data to determine if a particular patient has heart disease or not. Thus, heart disease of patient is the classes we must predict. 

## Split the heart disease dataset into training and test datasets

we will create randomly sampled training and test data sets. The `createDataPartition()` function from the R caret package is used  to create indices for the training data sample. In this case 80% of the data will be used  for training the model. Since this data set is small, only 36 cases will be included in the test dataset. Execute this code and note the dimensions of the resulting data frame.

```{r datasplit}

# Create the training and test datasets for disease_dataset

# Step 1: Get row numbers for the training data
partition <- createDataPartition(disease_data$heart_disease_present, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
train_diseaseData <- disease_data[partition,] # Create the training sample

cat("The dimension of the training set is (",dim(train_diseaseData),")")

# Step 3: Create the test dataset
test_diseaseData <- disease_data[-partition,] # Create the test sample

cat("The dimension of test set is (", dim(test_diseaseData),")")

```

## Scale numeric features

Numeric features must be rescaled so they have a similar range of values. Rescaling prevents features from having an undue influence on model training simply because then have a larger range of numeric variables. 

The code in the cell below uses the `preProcess()` function from the caret function. The processing is as follows:

1. The preprocessing model object is computed. In this case the processing includes centering and scaling the numeric feature. Notice that this model is fit only to the training data.

2. The scaling is applied to both the test and training partitions.

```{r Scaling}

# Scaling the continuous variables

preProcess_scale_model  <- preProcess(train_diseaseData, method = c("center", "scale"))

# Here is what preProcess_scale_model does.
# It only normalize the 5 continuous variables
print(preProcess_scale_model)

train_diseaseData = predict(preProcess_scale_model, train_diseaseData)

test_diseaseData = predict(preProcess_scale_model, test_diseaseData)
```


## One hot encoding

To convert all the nominal or factor variables to numeric, we need to create a **dummy variable** for each category of the categorical variables. Only one dummy variable is coded with a one for each set of categories. This is known as **one hot encoding**. By using numeric dummy variable, the entire training feature array is now numeric. To do this, we use `dummyVars()` function from the caret package. A predict method is applied to create numeric model matrices for training and test. 

```{r One hot encoding, message=F, comment=F, warning=F}

# Removing the class column on train data to be able to create a one hot encoding

xtrain= train_diseaseData[-length(train_diseaseData)]

# `fullRank = T` to avoid dummmy trap

dummies <- dummyVars( "~.", data = xtrain, fullRank = T)

# Here is what `dummyVars()` does.
# It created `one hot encoding` to the nominal variables

print(dummies)

xtrain_dummy <-  predict(dummies, newdata = xtrain)

# Convert to dataframe

xtrain <- as_tibble(xtrain_dummy)

# Apply onehot to `test` data

# Removing the class column on test data

xtest= test_diseaseData[-length(test_diseaseData)]

xtest_dummy <- predict(dummies, newdata= xtest)

# Convert to dataframe

xtest <- as_tibble(xtest_dummy)

```

## Feature selection

**Feature selection** can be an important part of model selection. In supervised learning, including features in a model which do not provide information on the label, is useless at best, and may prevent generalization at worst.

Feature selection can involve application of several methods. Two important methods include:
  
1. Eliminating features with **low variance** and **zero variance**. Zero variance features are comprised of the same values. Low variance features arise from features with most values the same and with few unique values. One way low variance features can arise, is from dummy variables for categories with very few members. The dummy variable will be mostly 0s with very few 1s. 

2. Training machine learning models with features that are **uninformative** can create a variety of problems. An uninformative feature does not significantly improve model performance. In many cases, the noise in the uninformative features will increase the variance of the model predictions. In other words, uninformative models are likely to reduce the ability of the machine learning model to generalize.   

```{r Feature selection, comment=F, message=F}

# Eliminate low variance features

near_zero = nearZeroVar(xtrain, freqCut = 95/5, uniqueCut = 10, saveMetrics = TRUE)

low_variance_cols <- near_zero[(near_zero$zeroVar == TRUE) | (near_zero$nzv == TRUE), ]

print(low_variance_cols) 

```

We will remove the features that will not make our models to generalize well on the test and validation datasets.


```{r low_variance_cols}

# Remove low variance columns on train set

xtrain <- xtrain %>% dplyr::select(-c(resting_ekg_results.1, thal.fixed_defect))

# Appending Y to the `xtrainDummy` dataset

xytrain <- bind_cols(xtrain, y=train_diseaseData$heart_disease_present)

# Remove low variance columns on test set

xtest <- xtest %>% dplyr::select(-c(resting_ekg_results.1, thal.fixed_defect))

# Appending Y to the `xtestDummy` dataset

xytest <- bind_cols(xtest, y=test_diseaseData$heart_disease_present)

```



```{r variable stored}

# Store X and Y for later use.
xtrain <- xtrain 
ytrain <- xytrain$y

xtest <- xtest
ytest <- xytest$y

ntr <- nrow(xytrain)
nte <- nrow(xytest)

```


## Evaluation metric

The metric used for the evaluation of the performance of each model is logarithmic loss. In order to calculate Log Loss, the classifier must assign a probability to each class rather than simply yielding the most likely class. Mathematically Log Loss is defined as :

$$\text{Logloss}= - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i)\right]$$

where:

$n$ - number of observations

$log$ - the natural logarithm

$y$ - a binary indicator ($0$ or $1$) of whether class label $c$ is the correct classification for observation $o$

$p$ - the model's predicted probability that observation $o$ is of class $c$.

Logarithmic loss provides a steep penalty for predictions that are both confident and wrong. That is, it takes into account the uncertainty of our model prediction based on how much it varies from the actual label. Logloss has no upper bound and it exists on the range $[0, \infty)$. Logloss nearer to $0$ indicates higher accuracy, whereas if the logloss is away from $0$ then it indicates lower accuracy. In general, the least logloss gives greater accuracy for the classifier. The goal is to minimize the logloss and a perfect classifier would have a logloss of precisely zero while less ideal classifiers have progressively larger values of logloss.


## Computational section

The predictive performance of machine learning models depend on the structure of the dataset and proper data preparation will ensure the models work optimally. Since the best machine learning method on dataset cannot be known beforehand, in this section, we consider different catalogs of **machine learning** algorithms which included **parametric** and **non parametric** on **heart disease** data and we evaluated the performance of each model with **logloss** metric on **test** data.

### R functions

We wrote two functions such as `prob.prediction()` and `logloss()`.

`prob.prediction()`: This function takes **yhat.model** as input and that enables us to have a dataframe that comprises the probability of model class prediction and the label of the test set.

`logloss()`: The metric for evaluating performance of each classifier that was used in the heart disease data. This function as two input namely **actual** which is the true class label from the test set and **predicted** which is the probability of the classifier's class prediction.


```{r R_Function, message= F, comment= F, warning= F}

# Probability of class prediction function to get the probability of the model prediction for the class labels

prob.prediction <- function(yhat.model){
  as_tibble(yhat.model) %>% mutate(prob=if_else(absent > present, absent, present)) %>% add_column(ytest= as.numeric(if_else(ytest == "absent", 0, 1))) 
}


# Logloss metric for evaluating performance of classifier

logloss = function(actual, predicted, eps = 1e-15) {
  yhat = pmin(pmax(predicted, eps), 1-eps)
  logloss <-  - (mean(actual * log(yhat) + (1 - actual) * log(1 - yhat)))
  return(logloss)
}

```

```{r model building, message= F, comment= F, warning= F}

# Models to consider


# LDA model

lda.model<- train(y ~ ., data = xytrain, method = "lda", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")

yhat.lda <- predict(lda.model, xtest, type = "prob")

lda_data <- prob.prediction(yhat.model = yhat.lda)

logloss_lda <- logloss(lda_data$ytest, lda_data$prob)

# GBM model

GBM.model <- train(y~., data=xytrain, method = "gbm", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, seeds = vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)), metric = "ROC", tuneGrid = expand.grid(interaction.depth = 1:2, shrinkage = .1, n.trees = c(10, 50, 100), n.minobsinnode = 10),verbose = FALSE)

yhat.GBM <- predict(GBM.model, xtest, type = "prob")

GBM_data <- prob.prediction(yhat.model = yhat.GBM)

logloss_GBM <- logloss(GBM_data$ytest, GBM_data$prob)


# SVM model

svm.model <- train(y ~ ., data = xytrain, method = "svmLinear2", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE), tuneGrid = data.frame(cost = c(.25, .5, 1)))

yhat.svm <- predict(svm.model, xtest, type = "prob")

svm_data <- prob.prediction(yhat.model = yhat.svm)

logloss_svm <- logloss(svm_data$ytest, svm_data$prob)


# KNN model

knn.model <- train(y ~.,data=xytrain, method = "knn", trControl=trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")

yhat.knn <- predict(knn.model, newdata = xtest, type="prob")

knn_data <- prob.prediction(yhat.model = yhat.knn)

logloss_knn <- logloss(knn_data$ytest, knn_data$prob)


# Ctree

ctree.model <- train(y ~ ., data = xytrain, method = "ctree",  trControl = trainControl(method = "cv", number = 10, returnResamp = "all"))

yhat.ctree <- predict(ctree.model, newdata = xtest, type="prob")


ctree_data <- prob.prediction(yhat.model = yhat.ctree)

logloss_ctree <- logloss(ctree_data$ytest, ctree_data$prob)


# CART

cart.model <- train(y ~ ., data = xytrain, method = "rpart",  trControl = trainControl(method = "cv", number = 10, returnResamp = "all"))

yhat.cart <- predict(cart.model, newdata = xtest, type="prob")


cart_data <- prob.prediction(yhat.model = yhat.cart)

logloss_cart <- logloss(cart_data$ytest, cart_data$prob)


# cforest

cforest.model <- train(y ~ ., data = xytrain, method = "cforest", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, seeds = vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)), metric = "ROC", controls = party::cforest_unbiased(ntree = 20))


yhat.cforest <- predict(cforest.model, newdata = xtest, type="prob")

cforest_data <- prob.prediction(yhat.model = yhat.cforest)

logloss_cforest <- logloss(cforest_data$ytest, cforest_data$prob)


# gausspr model

gausspr.model <- gausspr(y~., data=xytrain)
yhat.gausspr <- predict(gausspr.model, xtest, type='prob')

gausspr_data <- prob.prediction(yhat.model = yhat.gausspr)

logloss_gausspr <- logloss(gausspr_data$ytest, gausspr_data$prob)

# rForest

rforest.model <- train(y ~ ., data = xytrain, method = "rf", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE,summaryFunction = twoClassSummary, seeds = vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)),metric = "ROC", ntree = 20,importance = TRUE)

yhat.rforest <- predict( rforest.model, xtest, type='prob')

rforest_data <- prob.prediction(yhat.model = yhat.rforest)

logloss_rforest <- logloss(rforest_data$ytest, rforest_data$prob)


# Adaboost

adaboost.model <- train(y ~ ., data = xytrain, method = "adaboost", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary, seeds =vector(mode = "list", length = nrow(xytrain) + 1) %>% lapply(., function(x) 1:20)),metric = "ROC")


yhat.adaboost <- predict( adaboost.model, xtest, type='prob')

adaboost_data <- prob.prediction(yhat.model = yhat.adaboost)

logloss_adaboost <- logloss(adaboost_data$ytest, adaboost_data$prob)

# Nnet

nnet.model <- train(y ~ ., data = xytrain, method = "nnet", trControl = trainControl(method = "cv", number = 10, returnResamp = "all"),
trace = FALSE)

yhat.nnet <- predict(nnet.model, xtest, type="prob")

nnet_data <- prob.prediction(yhat.model = yhat.nnet)

logloss_nnet <- logloss(nnet_data$ytest, nnet_data$prob)


# LogitBoost

logit.model <- train(y ~ ., data = xytrain, method = "LogitBoost", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")

yhat.logit <- predict(logit.model, xtest, type="prob")

logit_data <- prob.prediction(yhat.model = yhat.logit)

logloss_logit <- logloss(logit_data$ytest, logit_data$prob)


# NaiveBayes

naiveBayes.model <- train( y ~ ., data = xytrain, method = "naive_bayes", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")

yhat.naiveBayes <- predict(naiveBayes.model, xtest, type='prob')

naiveBayes_data <- prob.prediction(yhat.model = yhat.naiveBayes)

logloss_naiveBayes <- logloss(naiveBayes_data$ytest, naiveBayes_data$prob)

# MARS model

mars.model <- train(y~., data=xytrain, method='earth', trControl=trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE),   tuneGrid = data.frame(degree = 1, nprune = (2:4)*2))

yhat.mars  <- predict(mars.model, xtest, type='prob')

mars_data <- prob.prediction(yhat.model = yhat.mars)

logloss_mars <- logloss(mars_data$ytest, mars_data$prob)

# glmnet model

glmnet.model <- train(y~., data=xytrain, method='glmnet',   trControl =trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary),metric = "ROC", tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15), .lambda = c((1:5)/10)))

yhat.glmnet  <- predict(glmnet.model, xtest, type='prob')

glmnet_data <- prob.prediction(yhat.model = yhat.glmnet)

logloss_glmnet <- logloss(glmnet_data$ytest, glmnet_data$prob)


# xgbTree

xgbtree.model <- train(y ~ ., data = xytrain, method = "xgbTree", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC", tuneGrid =  expand.grid(nrounds = c(1, 10), max_depth = c(1, 4), eta = c(.1, .4), gamma = 0, colsample_bytree = .7, min_child_weight = 1,subsample = c(.8, 1)))

yhat.xgbtree <- predict(xgbtree.model, xtest, type='prob')

xgbtree_data <- prob.prediction(yhat.model = yhat.xgbtree)

logloss_xgbtree <- logloss(xgbtree_data$ytest, xgbtree_data$prob)


j48.model <- train(y ~ ., data = xytrain, method = "J48", trControl = trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary), metric = "ROC")

yhat.j48 <- predict(j48.model, xtest, type='prob')

j48_data <- prob.prediction(yhat.model = yhat.j48)

logloss_j48 <- logloss(xgbtree_data$ytest, j48_data$prob)

```

```{r loglos_metric, message=F, comment=F, warning=F}

# Models logloss comparision to check for the best model

evaluation <- tibble(lda= logloss_lda, GBM= logloss_GBM, SVM =logloss_svm, KNN = logloss_knn, ctree = logloss_ctree, CART = logloss_cart, cforest = logloss_cforest, `gauss process` = logloss_gausspr, `Random forest` = logloss_rforest, Adaboost = logloss_adaboost, `Neural network` = logloss_nnet, "Logistic" = logloss_logit, Naivebayes = logloss_naiveBayes, MARS = logloss_mars, glmnet = logloss_glmnet, XGBtree =logloss_xgbtree, j48=logloss_j48)

logloss= t(evaluation)

Model = rownames(logloss)

comparision_table= as_tibble(logloss) %>% add_column(Model) %>% rename(logloss="V1") %>%  arrange(logloss) %>% add_column(SN= 1:length(Model),.after = 0) %>% dplyr::select(SN, Model, logloss)

```

> Evaluation table

```{r evaluation_table, eval= T, include= T, echo= F}

kable(comparision_table,align = c('c', 'c', 'c'), caption = "Models performance with logloss metric") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

```

```{r minlogloss}
# Model with minimin logloss

minlogloss <- comparision_table %>% filter(logloss == min(logloss))

kable(minlogloss, caption = 'Optimal model', align = c('c', 'c')) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

```

This shows that **`r minlogloss[,2]`** is the optimal model.


# Generalizing optimal model on validation dataset

We therefore, use the optimal model **`r minlogloss[,2]`** for predicting the class label on `validation` data.

```{r validation dataset, message=FALSE, warning=TRUE, comment=FALSE, cache=TRUE,include=TRUE}

# Validation dataset

validation_data <- read_csv("Data from DRIVENDATA/test_values.csv")

# Data wrangling and preprocessing of validation data

validation_data <- validation_data[-1] # patient_id column dropped

factor_variable_position <- c(1, 2, 4, 6, 7, 10, 13)

validation_data<- validation_data %>% mutate_at(vars(factor_variable_position),as_factor)


# Applying preprocessing to the validation dataset

validation_data <-  predict(preProcess_scale_model,newdata = validation_data)

validation_dummy <-  predict(dummies, newdata = validation_data)

# Convert to dataframe

xvalidation <- as_tibble(validation_dummy)

# Remove low variance columns on validation set

xvalidation <- xvalidation %>% dplyr::select(-c(resting_ekg_results.1,
thal.fixed_defect))

# Predicting the class label

validation_label <- predict(adaboost.model, xvalidation)

# Convert to data frame

validation_label <- tibble(heart_disease_present=validation_label)

validation_table <- validation_label %>% count(heart_disease_present) %>% mutate(Percent = round(n/sum(n)*100, 2))

```

## Class prediction label

```{r Predicted_labels, eval=T, include=T, echo=F}
# Counting the number of each predicted class

kable(validation_table, caption = 'Summary of predicted labels on validation data', align = c('l','c')) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "900px", height = "10px")
```


```{r submit_format}

# Validation label prediction

submission_format <- read_csv("Data from DRIVENDATA/submission_format.csv")

submission_format <- submission_format %>% mutate(heart_disease_present=validation_label$heart_disease_present) 
```


```{r Validation class labels, eval=T, include=T, echo=F}
# Counting the number of each predicted class

kable(submission_format, caption = 'Validation data class prediction using optimal model (Adaboost classifier)', align = c('l','c')) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), fixed_thead = T) %>% scroll_box(width = "900px", height = "400px")


```

# Conclusion

This study considered $17$ different catalogs of machine learning models which are carefully selected from the set of parametric and/or non-parametric models to choose the best optimal model with the least **logloss**. `caret` package was used to tune our different models and the optimal model **`r minlogloss[,2]`** predicted (`r validation_table[1,2]`, `r paste0( validation_table[1,3], "%")`) has `r validation_table[1,1]` while (`r validation_table[2,2]`, `r paste0( validation_table[2,3], "%")`) were predicted to have `r validation_table[2,1]` of heart disease on the validation dataset.

---

If you like this writeup, you can also follow me on [Twitter](https://www.twitter.com/gbganalyst){target="_blank"} and [Linkedin](https://www.linkedin.com/in/ezekiel-ogundepo/){target="_blank"} for more updates in `R` and `Python` for datascience.
